In this project, we are digging into the relationship between human activity and weather in New York city. Our three driving questions are:
For our first project, we analyzed daily weather patterns from data collected at a weather station in Central Park, New York City made available online by the National Oceanic and Atmospheric Administration. Through our analysis, we confirmed that there was a statistically significant rise in daily maximum temperatures in Central Park over the last 122 years.
We performed an ANOVA test on daily maximum temperature values over different periods of time and found statistically significant results regarding variance in-between our samples. This led us to create linear models for the change in daily maximum temperature over time, revealing statistically significant warming at an average rate of about 0.025 degrees Fahrenheit per year from 1900-2022. This is in fact a larger increase in temperature than the average global warming trend reported by [INSERT ORGANISATION NAME HERE!] (an average of 0.014 degrees Fahrenheit per year). However, since 1982, average temperatures in Central Park have increased significantly less than average global warming, perhaps because much of the development in New York City took place during the first half of the century.
We had more questions about relationships between weather and human activity, which are explored here in our Final Project.
For this project, we looked more directly at correlations between human activity and weather by incorporating new datasets related to population, air quality, crime (shootings and arrests), the stock market, and COVID-19.
Emily will re-do linear regression looking at measures of local and global human activity as regressors rather than year. She might also look into variable transformations (i.e., linear models fit to polynomials of regressors) to see if the response is best fit as linear or polynomial.
At the end of our exploratory data analysis, we developed a linear model of maximum daily temperature over time, with year as a linear regressor. This revealed to us that there is a statistically significant increase in average maximum temperatures over time. However, we do not suspect that time is the cause– rather, it is something else that has changed over time that has caused the warming in New York. We wanted to explore correlations with other, more direct proxies for human activity.
Our original fit used year as a numerical regressor and month as a categorical regressor. The resulting fit has an r-squared value of 0.775 and a slope of 0.025 degrees Fahrenheit per year, with all fit parameters’ p-values well below 0.05. The different intercepts for the each level of the categorical variable (the twelve months of the year) indicated that January is the coldest and July the hottest month in Central Park, with an average difference in maximum daily temperature of approximately 46 degrees Fahrenheit in any given year over this window.
The two extremes and their linear models are plotted in the following figure.
Do other weather variables correlate to TMAX?
We have found a reasonable linear model for temperature over time, but can we look instead at the connection to human activities, rather than time? Can we use some aspect of human activity as a regressor and generate a reasonable model?
We looked to the Census for U.S. population data, but that is only reported decennially, so we looked for other sources. We found historical data back to 1960 for New York state online https://www.macrotrends.net/cities/23083/new-york-city/population. Because this source is not known to us, we validated it against decennial census data.
A bunch of linear models…
The Air Quality Index (AQI) is used for daily reporting of local air quality. It tells us how clean or polluted the air is, and what associated health effects might be a concern for the public. The higher the AQI value, the the greater the level of air pollution and greater the health concern. Outdoor concentrations of pollutants such as Ozone, Carbon Monoxide, Nitrogen dioxide, Sulfur Dioxide, and PM2.5/PM10 concentrations are measured at stations across New York City and reported to the EPA. The daily AQI is calculated based on these concentration values and stored within the EPA’s Air Quality System database.
Changes in urban life correlate with changes in air quality within that urban area. Sources of emissions such as traffic and burning of fossil fuels for energy generation can cause air quality to deteriorate. Emissions can also contribute to global warming by releasing more greenhouse gasses into the atmosphere, thus increasing average temperatures. As more people migrate to urban areas, we will continue to see a deterioration in air quality unless reducing measures are taken. Our goal for integrating this data is to study the affects of weather patterns on air quality, and to statistically verify changes in air quality over time in New York City.
The dataset contains about 7,000 observations collected from January, 2000 to October, 2022.
We start by looking at the distribution of our variable of interest: AQI.
From the histogram above, we can gauge that the distribution is slightly right-skewed. With the large number of observations in our dataset, we can assume normality for our modeling. The right-skewness is caused by days with unusually high AQI values.
The year-over-year growth rate was also calculated based on yearly average AQI and is depicted in the line plot below.
We can see an alternating patterns of increase and decrease in average AQI between each year from 2000 to 2009. After 2009, the pattern is broken but the variance continues.
In order to evaluate correlation between weather and air quality, we combined our dataset with the NYC weather data based on the date value in each. Dates without a matching air quality measurement are dropped. Subsequent models will be built using this merged dataframe.
The first step to building linear models is assessing correlation between numerical variables in the data. Because the year variable in our dataset begins at 2000, it will unnecessarily scale our coefficients when used in linear modeling. Therefore, we scaled the variable to start at 0 (and continue to 22 to represent 2022).
The correlation is evaluated via a pairs plot, which depicts the correlation coefficient between numerical variables, and includes scatterplots of their relationships. The pairs plot uses the Pearson correlation method.
From the pearson pairsplot above, we can see a moderately high, negative correlation value between year and AQI. This indicates that as the year increases, the AQI is actually dropping resulting in better air quality in the city.
To better observe the effects of year on AQI, we can visualize the yearly average AQI.
The line plot confirms the correlation value we observed in the pairs plot. The average yearly AQI is indeed decreasing as year increases. Next, we build a linear model using year as a regressor to estimate daily AQI.
The results of our linear model reveal a significant value for both the intercept and year coefficient. The coefficient value for the year regressor indicates that for every year increase, the predicted daily AQI decreases by a factor of 1.78. This supports the correlation coefficient we saw earlier between these two variables. The p-value of the F-statistic is also significant, but the \(R^2\) value of the model is a measly 0.28. Based on this model, the year only explains 28% of the variability in daily AQI measurements. This is not a significantly high result. Looking at the scatterplot of the relationship can help explain the weak fit.
As we can see, there is a high degree of noise when observing daily
AQI values at the yearly level. Although the plot displays a slightly
downward trend in daily AQI, but model fit is distorted. This helps
explain the results we received from our linear model.
Can we add more or different predictors to improve the fit? In our first
project, we looked at linear trends of TMAX over time and determined a
slight positive correlation observed over the years 1900-2022. We also
utilized month as a categorical regressor to help explain the variance
in daily maximum temperatures. Based on those results, we concluded that
seasonality trends had a negative impact on model performance. Perhaps
seasonality also also plays a part in daily AQI measurements.
To refresh our memories, we included the monthly average daily maximum temperature. A seasonal trend can be observed as temperatures increase during summer months and decrease during winter months.
Plotting the average AQI by month, we observe seasonal trends. AQI values are generally high during winter and summer months, but realtively low for the the months in between.
Based on this, we modify our last model and attempt to fit
seasonality by adding month as a categorical regressor,
along with our variable-of-interest from the last project - TMAX.
The regression coefficient for TMAX is significant and positively correlated, with each degree Fahrenheit increase resulting in AQI increasing by a factor of 0.68. The regression coefficients for all month categories are also significant. In fact, every month has a negative impact on AQI when compared to January. September exhibits the largest difference, with a predicted AQI almost 44 points lower than January!
The p-value of the model’s F-statistic is also significant,
concluding a significant relationship between our chosen predictors and
the daily AQI value. However, the \(R^2\) for our model is only
.149, which is weaker than our previous model. This
indicates that only 14.7% of the variation in daily AQI can be explained
by TMAX and month.
The VIF scores for all regressors are in an acceptable range, however the fit is still poor. From our results, we can conclude that linear models are insufficient for representing data with seasonal trends.
It seems that due to seasonal nature of our time-series data, we cannot properly model daily AQI using linear regression. Perhaps a classification technique can be utilized to address the seasonal trends. More precisely, we can build a kNN model to classify the month based on daily AQI and maximum temperature values.
We start with plotting the relationship between our chosen predictors and add a layer to discern month within the plot.
We can make out minimal distinction of month from the scatterplot above, but the model will provide a more detailed analysis.
The first step involves scaling and centering our predictor values, as they are recorded in vastly different units of measurement. We also need to split our dataset into training and testing frames. We used a 4:1 split for to satisfy this requirement.
To find the optimal k-value, we evaluated the model over a range of k from 1 to 21. Based on the plot above, it seems 13-nearest neighbors is a decent choice as it provides the greatest improvement in predictive accuracy before the incremental improvement trails off. We can build the kNN model using 13 as the k-value.
The overall accuracy of our model is a relatively weak value of 0.257. This indicates that AQI and TMAX are not good predictors of month.
## Multi-class area under the curve: 0.644
A multi-class ROC evaluation on the test labels yields an AUC value of 0.65, which is higher than expected based on the model’s accuracy value. Still, this is not a significant result based on the AUC threshold of 0.8.
We have statistically significant correlations on models using weather, air quality, and human activity data from NYC, but none of our models demonstrate high predictive potential.
Challenges with data – data were not ideal for our initial hypotheses– cannot reject or accept. Would need better data!
Cannot model seasonality with linear regressions. Potentially need to use seasonality-adjusted time-series models for better outcomes.
What data would enable us to answer the question if we had more time?
Our hypothesis that a correlation would exist between daily weather and air quality variables was ultimately proven wrong. We observed trends of declining AQI over time, but the explanation of variance from our model results was not strong enough to deem the model a good fit. Similarly, a linear model predicting AQI based on the categorical month variable, along with TMAX, also resulted in a poor fit.
We determined that the relationship between air quality and global warming is difficult to model using linear techniques due to seasonal trends in the variables. Our attempt to model the effect of these trends using kNN also resulted in a poor-fitted model. Ultimately, a different type of model would be required to address the seasonal component.
Also, changes in climate are slow to take effect. A increase in emissions does not necessarily lead to increases in temperature on the same time scale. All these effects would need to be taken into consideration for an effective analysis.
Our hypothesis that a relationship between daily weather and local human activity, via crime, stock market, and public health data, was shown to exist in some areas. Both crime and stock market trade volume had statistically significant correlations to daily weather variables in our linear models. Crime is correlated to both temperature and precipitation while stock market trade volume is related to temperature. However, neither of these correlations are strong enough to be predictive.
The relationship between crime and weather was the strongest from this analysis. This represents a valuable area for future study in the context of changing weather patterns that was explored in our early project.
There were notable limitations to the methods in this analysis. One key limitation that affected the analysis of public health was the availability of essential data. The COVID-19 case data was based on dates when positive cases were confirmed, rather than tested. Because test and confirmation dates are not always the same, this limited our ability assess relationships that existed on the day of an individual’s test. Looking for alternative data sources to explore this relationship would be an interesting area for a future project.
These questions are only some of the important questions that should be asked about the relationship between humans and weather. As climate change continues, it is important to have an understanding on how it may affect individuals at all levels from the global scale down to local level.
Lindsey, R.; Dahlman, L. (2022, June 28). Climate change: Global temperature. Climate.gov. Retrieved December 11, 2022, from https://www.climate.gov/news-features/understanding-climate/climate-change-global-temperature
United States Environmental Protection Agency, Air Data Basic Information., retrieved December 8, 2022, from, https://www.epa.gov/outdoor-air-quality-data/air-data-basic-information
NYC Environment and Health Data Portal, Tracking changes in New York City’s sources of air pollution., published online April 12, 2021, retrieved December 6, 2022, from https://a816-dohbesp.nyc.gov/IndicatorPublic/beta/data-stories/aq-cooking/